Exploratory Data Analysis By: Chinmay Jain ST1128 Manisha Mehta ST1147 Sarthak Priyank Verma ST1170 Issac ST1137

Data Set: IMDB Movie Rating

Source: https://www.kaggle.com/datasets/carolzhangdc/imdb-5000-movie-dataset

Background: A commercial success movie not only entertains audience, but also enables film companies to gain tremendous profit. A lot of factors such as good directors, experienced actors are considerable for creating good movies. However, famous directors and actors can always bring an expected box-office income but cannot guarantee a highly rated IMDb score.

Problem statement: A renowned film production company has partnered with Mu Sigma to gain a comprehensive understanding of the key factors that contribute to the success of movies. The primary goal is to leverage an in-depth analysis of IMDb ratings to identify patterns, insights, and actionable recommendations that can enhance the director’s future film projects and overall success in the highly competitive film industry.

Expected Outcomes: Identifying the key factors that influence IMDb rating. This can be done by performing correlation analysis on the IMDb dataset to identify the factors that are most strongly correlated with IMDb rating.

Understanding the relationships between different factors and IMDb rating. Mu Sigma can use analysis to model the relationship between IMDb rating and other movie features. This can help to understand how different factors impact IMDb rating, and to identify the factors that are most important for predicting the success of a movie.

Segmenting movies into different groups based on their features. This can help to identify different types of movies and to understand the characteristics of each type of movie. This information can be used to develop more targeted marketing and distribution strategies for different types of movies.

Target Variables: IMDb Score,Profit,FaceBook Likes

Columns description(meta data)

color: This column indicates whether the movie is in color (e.g., “Color” or “Black and White”).

director_name: The name of the movie’s director, the person responsible for overseeing the creative aspects of the film.

num_critic_for_reviews: this column represents the number of critic reviews or critiques that a movie has received, which can provide insight into its critical reception.

duration: The duration of the movie in minutes, indicating its length.

director_facebook_likes: The number of Facebook likes for the movie’s director, indicating their social media popularity.

actor_3_facebook_likes: The number of Facebook likes for the third-billed actor in the movie’s cast, indicating their popularity

actor_2_name: The name of the second-billed actor in the movie’s cast.

actor_1_facebook_likes: The number of Facebook likes for the first-billed actor in the movie’s cast.

genres: The genres or categories that the movie belongs to (e.g., “Action,” “Comedy,” “Drama,” etc.).

actor_1_name: The name of the first-billed actor in the movie’s cast.

movie_title: The title of the movie.

num_voted_users: The number of users who voted or rated the movie, which can reflect its popularity.

cast_total_facebook_likes: The total number of Facebook likes for the movie’s entire cast.

actor_3_name: The name of the third-billed actor in the movie’s cast.

facenumber_in_poster: The number of faces on the movie poster which may or may not be relevant to the movie’s success.

plot_keywords: Keywords or phrases describing the movie’s plot, themes, or content.

movie_imdb_link: A link to the movie’s IMDb page for additional information.

num_user_for_reviews: The number of user reviews for the movie, which can provide insight into its audience reception.

language: The primary language in which the movie is spoken or produced.

country: The country of origin for the movie.

content_rating: The content rating assigned to the movie, such as “PG-13,” “R,” “G,” etc.

budget: The budget or money used for making the entire movie

title_year: The year in which the movie was released.

actor_2_facebook_likes: The number of Facebook likes for the second-billed actor in the movie’s cast.

imdb_score: The IMDb rating score reflecting the movie’s overall quality as rated by users.

aspect_ratio: The aspect ratio is used for the movie’s display ratio (e.g., 16:9, 2.35:1).

movie_facebook_likes: The number of Facebook likes for the movie’s official Facebook page.

gross: The total gross revenue generated by the movie, indicating its financial success.

Importing relevant libraries

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rvest)
## Warning: package 'rvest' was built under R version 4.1.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.3
library(ggplot2)
library(stringr)
## Warning: package 'stringr' was built under R version 4.1.3
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded

Reading the CSV file containing IMDb movie metadata into a data frame named IMDb

IMDB <- read.csv("C:\\Users\\chinmay.Jain\\Desktop\\R\\movie.csv")

Summary statistics for the IMDB data frame

summary(IMDB)
##     color           director_name      num_critic_for_reviews    duration    
##  Length:5043        Length:5043        Min.   :  1.0          Min.   :  7.0  
##  Class :character   Class :character   1st Qu.: 50.0          1st Qu.: 93.0  
##  Mode  :character   Mode  :character   Median :110.0          Median :103.0  
##                                        Mean   :140.2          Mean   :107.2  
##                                        3rd Qu.:195.0          3rd Qu.:118.0  
##                                        Max.   :813.0          Max.   :511.0  
##                                        NA's   :50             NA's   :15     
##  director_facebook_likes actor_3_facebook_likes actor_2_name      
##  Min.   :    0.0         Min.   :    0.0        Length:5043       
##  1st Qu.:    7.0         1st Qu.:  133.0        Class :character  
##  Median :   49.0         Median :  371.5        Mode  :character  
##  Mean   :  686.5         Mean   :  645.0                          
##  3rd Qu.:  194.5         3rd Qu.:  636.0                          
##  Max.   :23000.0         Max.   :23000.0                          
##  NA's   :104             NA's   :23                               
##  actor_1_facebook_likes     gross              genres         
##  Min.   :     0         Min.   :      162   Length:5043       
##  1st Qu.:   614         1st Qu.:  5340988   Class :character  
##  Median :   988         Median : 25517500   Mode  :character  
##  Mean   :  6560         Mean   : 48468408                     
##  3rd Qu.: 11000         3rd Qu.: 62309438                     
##  Max.   :640000         Max.   :760505847                     
##  NA's   :7              NA's   :884                           
##  actor_1_name       movie_title        num_voted_users  
##  Length:5043        Length:5043        Min.   :      5  
##  Class :character   Class :character   1st Qu.:   8594  
##  Mode  :character   Mode  :character   Median :  34359  
##                                        Mean   :  83668  
##                                        3rd Qu.:  96309  
##                                        Max.   :1689764  
##                                                         
##  cast_total_facebook_likes actor_3_name       facenumber_in_poster
##  Min.   :     0            Length:5043        Min.   : 0.000      
##  1st Qu.:  1411            Class :character   1st Qu.: 0.000      
##  Median :  3090            Mode  :character   Median : 1.000      
##  Mean   :  9699                               Mean   : 1.371      
##  3rd Qu.: 13756                               3rd Qu.: 2.000      
##  Max.   :656730                               Max.   :43.000      
##                                               NA's   :13          
##  plot_keywords      movie_imdb_link    num_user_for_reviews   language        
##  Length:5043        Length:5043        Min.   :   1.0       Length:5043       
##  Class :character   Class :character   1st Qu.:  65.0       Class :character  
##  Mode  :character   Mode  :character   Median : 156.0       Mode  :character  
##                                        Mean   : 272.8                         
##                                        3rd Qu.: 326.0                         
##                                        Max.   :5060.0                         
##                                        NA's   :21                             
##    country          content_rating         budget            title_year  
##  Length:5043        Length:5043        Min.   :2.180e+02   Min.   :1916  
##  Class :character   Class :character   1st Qu.:6.000e+06   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :2.000e+07   Median :2005  
##                                        Mean   :3.975e+07   Mean   :2002  
##                                        3rd Qu.:4.500e+07   3rd Qu.:2011  
##                                        Max.   :1.222e+10   Max.   :2016  
##                                        NA's   :492         NA's   :108   
##  actor_2_facebook_likes   imdb_score     aspect_ratio   movie_facebook_likes
##  Min.   :     0         Min.   :1.600   Min.   : 1.18   Min.   :     0      
##  1st Qu.:   281         1st Qu.:5.800   1st Qu.: 1.85   1st Qu.:     0      
##  Median :   595         Median :6.600   Median : 2.35   Median :   166      
##  Mean   :  1652         Mean   :6.442   Mean   : 2.22   Mean   :  7526      
##  3rd Qu.:   918         3rd Qu.:7.200   3rd Qu.: 2.35   3rd Qu.:  3000      
##  Max.   :137000         Max.   :9.500   Max.   :16.00   Max.   :349000      
##  NA's   :13                             NA's   :329

Viewing the top 5 rows of the dataset

head(IMDB, 5)
##   color     director_name num_critic_for_reviews duration
## 1 Color     James Cameron                    723      178
## 2 Color    Gore Verbinski                    302      169
## 3 Color        Sam Mendes                    602      148
## 4 Color Christopher Nolan                    813      164
## 5             Doug Walker                     NA       NA
##   director_facebook_likes actor_3_facebook_likes     actor_2_name
## 1                       0                    855 Joel David Moore
## 2                     563                   1000    Orlando Bloom
## 3                       0                    161     Rory Kinnear
## 4                   22000                  23000   Christian Bale
## 5                     131                     NA       Rob Walker
##   actor_1_facebook_likes     gross                          genres
## 1                   1000 760505847 Action|Adventure|Fantasy|Sci-Fi
## 2                  40000 309404152        Action|Adventure|Fantasy
## 3                  11000 200074175       Action|Adventure|Thriller
## 4                  27000 448130642                 Action|Thriller
## 5                    131        NA                     Documentary
##      actor_1_name                                              movie_title
## 1     CCH Pounder                                                 Avatar 
## 2     Johnny Depp               Pirates of the Caribbean: At World's End 
## 3 Christoph Waltz                                                Spectre 
## 4       Tom Hardy                                  The Dark Knight Rises 
## 5     Doug Walker Star Wars: Episode VII - The Force Awakens             
##   num_voted_users cast_total_facebook_likes         actor_3_name
## 1          886204                      4834            Wes Studi
## 2          471220                     48350       Jack Davenport
## 3          275868                     11700     Stephanie Sigman
## 4         1144337                    106759 Joseph Gordon-Levitt
## 5               8                       143                     
##   facenumber_in_poster
## 1                    0
## 2                    0
## 3                    1
## 4                    0
## 5                    0
##                                                      plot_keywords
## 1                           avatar|future|marine|native|paraplegic
## 2     goddess|marriage ceremony|marriage proposal|pirate|singapore
## 3                              bomb|espionage|sequel|spy|terrorist
## 4 deception|imprisonment|lawlessness|police officer|terrorist plot
## 5                                                                 
##                                        movie_imdb_link num_user_for_reviews
## 1 http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1                 3054
## 2 http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1                 1238
## 3 http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1                  994
## 4 http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1                 2701
## 5 http://www.imdb.com/title/tt5289954/?ref_=fn_tt_tt_1                   NA
##   language country content_rating   budget title_year actor_2_facebook_likes
## 1  English     USA          PG-13 2.37e+08       2009                    936
## 2  English     USA          PG-13 3.00e+08       2007                   5000
## 3  English      UK          PG-13 2.45e+08       2015                    393
## 4  English     USA          PG-13 2.50e+08       2012                  23000
## 5                                       NA         NA                     12
##   imdb_score aspect_ratio movie_facebook_likes
## 1        7.9         1.78                33000
## 2        7.1         2.35                    0
## 3        6.8         2.35                85000
## 4        8.5         2.35               164000
## 5        7.1           NA                    0

Viewing dimensions of the dataset

dim(IMDB)
## [1] 5043   28

Getting the data types of each column in the dataset

str(IMDB)
## 'data.frame':    5043 obs. of  28 variables:
##  $ color                    : chr  "Color" "Color" "Color" "Color" ...
##  $ director_name            : chr  "James Cameron" "Gore Verbinski" "Sam Mendes" "Christopher Nolan" ...
##  $ num_critic_for_reviews   : int  723 302 602 813 NA 462 392 324 635 375 ...
##  $ duration                 : int  178 169 148 164 NA 132 156 100 141 153 ...
##  $ director_facebook_likes  : int  0 563 0 22000 131 475 0 15 0 282 ...
##  $ actor_3_facebook_likes   : int  855 1000 161 23000 NA 530 4000 284 19000 10000 ...
##  $ actor_2_name             : chr  "Joel David Moore" "Orlando Bloom" "Rory Kinnear" "Christian Bale" ...
##  $ actor_1_facebook_likes   : int  1000 40000 11000 27000 131 640 24000 799 26000 25000 ...
##  $ gross                    : int  760505847 309404152 200074175 448130642 NA 73058679 336530303 200807262 458991599 301956980 ...
##  $ genres                   : chr  "Action|Adventure|Fantasy|Sci-Fi" "Action|Adventure|Fantasy" "Action|Adventure|Thriller" "Action|Thriller" ...
##  $ actor_1_name             : chr  "CCH Pounder" "Johnny Depp" "Christoph Waltz" "Tom Hardy" ...
##  $ movie_title              : chr  "Avatar " "Pirates of the Caribbean: At World's End " "Spectre " "The Dark Knight Rises " ...
##  $ num_voted_users          : int  886204 471220 275868 1144337 8 212204 383056 294810 462669 321795 ...
##  $ cast_total_facebook_likes: int  4834 48350 11700 106759 143 1873 46055 2036 92000 58753 ...
##  $ actor_3_name             : chr  "Wes Studi" "Jack Davenport" "Stephanie Sigman" "Joseph Gordon-Levitt" ...
##  $ facenumber_in_poster     : int  0 0 1 0 0 1 0 1 4 3 ...
##  $ plot_keywords            : chr  "avatar|future|marine|native|paraplegic" "goddess|marriage ceremony|marriage proposal|pirate|singapore" "bomb|espionage|sequel|spy|terrorist" "deception|imprisonment|lawlessness|police officer|terrorist plot" ...
##  $ movie_imdb_link          : chr  "http://www.imdb.com/title/tt0499549/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt0449088/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt2379713/?ref_=fn_tt_tt_1" "http://www.imdb.com/title/tt1345836/?ref_=fn_tt_tt_1" ...
##  $ num_user_for_reviews     : int  3054 1238 994 2701 NA 738 1902 387 1117 973 ...
##  $ language                 : chr  "English" "English" "English" "English" ...
##  $ country                  : chr  "USA" "USA" "UK" "USA" ...
##  $ content_rating           : chr  "PG-13" "PG-13" "PG-13" "PG-13" ...
##  $ budget                   : num  2.37e+08 3.00e+08 2.45e+08 2.50e+08 NA ...
##  $ title_year               : int  2009 2007 2015 2012 NA 2012 2007 2010 2015 2009 ...
##  $ actor_2_facebook_likes   : int  936 5000 393 23000 12 632 11000 553 21000 11000 ...
##  $ imdb_score               : num  7.9 7.1 6.8 8.5 7.1 6.6 6.2 7.8 7.5 7.5 ...
##  $ aspect_ratio             : num  1.78 2.35 2.35 2.35 NA 2.35 2.35 1.85 2.35 2.35 ...
##  $ movie_facebook_likes     : int  33000 0 85000 164000 0 24000 0 29000 118000 10000 ...

Count the number of duplicated rows in the dataset

sum(duplicated(IMDB))
## [1] 45

Remove duplicate rows from the dataset

IMDB <- unique(IMDB)

Print the first 5 movie titles

head(IMDB$movie_title, 5)
## [1] "Avatar "                                                
## [2] "Pirates of the Caribbean: At World's End "              
## [3] "Spectre "                                               
## [4] "The Dark Knight Rises "                                 
## [5] "Star Wars: Episode VII - The Force Awakens             "

Remove special character “” from movie titles in the movie_title column

Replace special character with “” and Trimming spaces from the right end

IMDB$movie_title <- str_replace_all(IMDB$movie_title, "Â", "") 

IMDB$movie_title <- str_trim(IMDB$movie_title, side = "right")  
head(IMDB$movie_title)
## [1] "Avatar"                                    
## [2] "Pirates of the Caribbean: At World's End"  
## [3] "Spectre"                                   
## [4] "The Dark Knight Rises"                     
## [5] "Star Wars: Episode VII - The Force Awakens"
## [6] "John Carter"

Separating rows in the ‘genres’ column based on the ‘|’ separator and creating a new column ‘genre_indicator’ with value 1

IMDB <- IMDB %>%
  separate_rows(genres, sep = "\\|") %>%
  mutate(genre_indicator = 1) %>%
  spread(genres, genre_indicator, fill = 0)
IMDB
## # A tibble: 4,998 x 53
##    color   director_name  num_critic_for_reviews duration director_facebook_li~1
##    <chr>   <chr>                           <int>    <int>                  <int>
##  1 "Color" James Cameron                     723      178                      0
##  2 "Color" Gore Verbinski                    302      169                    563
##  3 "Color" Sam Mendes                        602      148                      0
##  4 "Color" Christopher N~                    813      164                  22000
##  5 ""      Doug Walker                        NA       NA                    131
##  6 "Color" Andrew Stanton                    462      132                    475
##  7 "Color" Sam Raimi                         392      156                      0
##  8 "Color" Nathan Greno                      324      100                     15
##  9 "Color" Joss Whedon                       635      141                      0
## 10 "Color" David Yates                       375      153                    282
## # i 4,988 more rows
## # i abbreviated name: 1: director_facebook_likes
## # i 48 more variables: actor_3_facebook_likes <int>, actor_2_name <chr>,
## #   actor_1_facebook_likes <int>, gross <int>, actor_1_name <chr>,
## #   movie_title <chr>, num_voted_users <int>, cast_total_facebook_likes <int>,
## #   actor_3_name <chr>, facenumber_in_poster <int>, plot_keywords <chr>,
## #   movie_imdb_link <chr>, num_user_for_reviews <int>, language <chr>, ...

Counting the number of NULL values in each column

colSums(is.na(IMDB))
##                     color             director_name    num_critic_for_reviews 
##                         0                         0                        49 
##                  duration   director_facebook_likes    actor_3_facebook_likes 
##                        15                       103                        23 
##              actor_2_name    actor_1_facebook_likes                     gross 
##                         0                         7                       874 
##              actor_1_name               movie_title           num_voted_users 
##                         0                         0                         0 
## cast_total_facebook_likes              actor_3_name      facenumber_in_poster 
##                         0                         0                        13 
##             plot_keywords           movie_imdb_link      num_user_for_reviews 
##                         0                         0                        21 
##                  language                   country            content_rating 
##                         0                         0                         0 
##                    budget                title_year    actor_2_facebook_likes 
##                       487                       107                        13 
##                imdb_score              aspect_ratio      movie_facebook_likes 
##                         0                       327                         0 
##                    Action                 Adventure                 Animation 
##                         0                         0                         0 
##                 Biography                    Comedy                     Crime 
##                         0                         0                         0 
##               Documentary                     Drama                    Family 
##                         0                         0                         0 
##                   Fantasy                 Film-Noir                 Game-Show 
##                         0                         0                         0 
##                   History                    Horror                     Music 
##                         0                         0                         0 
##                   Musical                   Mystery                      News 
##                         0                         0                         0 
##                Reality-TV                   Romance                    Sci-Fi 
##                         0                         0                         0 
##                     Short                     Sport                  Thriller 
##                         0                         0                         0 
##                       War                   Western 
##                         0                         0

Calculating null percentage of each column

null_percentages <- colMeans(is.na(IMDB)) * 100

print(null_percentages)
##                     color             director_name    num_critic_for_reviews 
##                 0.0000000                 0.0000000                 0.9803922 
##                  duration   director_facebook_likes    actor_3_facebook_likes 
##                 0.3001200                 2.0608243                 0.4601841 
##              actor_2_name    actor_1_facebook_likes                     gross 
##                 0.0000000                 0.1400560                17.4869948 
##              actor_1_name               movie_title           num_voted_users 
##                 0.0000000                 0.0000000                 0.0000000 
## cast_total_facebook_likes              actor_3_name      facenumber_in_poster 
##                 0.0000000                 0.0000000                 0.2601040 
##             plot_keywords           movie_imdb_link      num_user_for_reviews 
##                 0.0000000                 0.0000000                 0.4201681 
##                  language                   country            content_rating 
##                 0.0000000                 0.0000000                 0.0000000 
##                    budget                title_year    actor_2_facebook_likes 
##                 9.7438976                 2.1408563                 0.2601040 
##                imdb_score              aspect_ratio      movie_facebook_likes 
##                 0.0000000                 6.5426170                 0.0000000 
##                    Action                 Adventure                 Animation 
##                 0.0000000                 0.0000000                 0.0000000 
##                 Biography                    Comedy                     Crime 
##                 0.0000000                 0.0000000                 0.0000000 
##               Documentary                     Drama                    Family 
##                 0.0000000                 0.0000000                 0.0000000 
##                   Fantasy                 Film-Noir                 Game-Show 
##                 0.0000000                 0.0000000                 0.0000000 
##                   History                    Horror                     Music 
##                 0.0000000                 0.0000000                 0.0000000 
##                   Musical                   Mystery                      News 
##                 0.0000000                 0.0000000                 0.0000000 
##                Reality-TV                   Romance                    Sci-Fi 
##                 0.0000000                 0.0000000                 0.0000000 
##                     Short                     Sport                  Thriller 
##                 0.0000000                 0.0000000                 0.0000000 
##                       War                   Western 
##                 0.0000000                 0.0000000

Since the percentage of null values is <10% Dropping rows with any NA values from the IMDB dataset

IMDB <- na.omit(IMDB)
colSums(is.na(IMDB))
##                     color             director_name    num_critic_for_reviews 
##                         0                         0                         0 
##                  duration   director_facebook_likes    actor_3_facebook_likes 
##                         0                         0                         0 
##              actor_2_name    actor_1_facebook_likes                     gross 
##                         0                         0                         0 
##              actor_1_name               movie_title           num_voted_users 
##                         0                         0                         0 
## cast_total_facebook_likes              actor_3_name      facenumber_in_poster 
##                         0                         0                         0 
##             plot_keywords           movie_imdb_link      num_user_for_reviews 
##                         0                         0                         0 
##                  language                   country            content_rating 
##                         0                         0                         0 
##                    budget                title_year    actor_2_facebook_likes 
##                         0                         0                         0 
##                imdb_score              aspect_ratio      movie_facebook_likes 
##                         0                         0                         0 
##                    Action                 Adventure                 Animation 
##                         0                         0                         0 
##                 Biography                    Comedy                     Crime 
##                         0                         0                         0 
##               Documentary                     Drama                    Family 
##                         0                         0                         0 
##                   Fantasy                 Film-Noir                 Game-Show 
##                         0                         0                         0 
##                   History                    Horror                     Music 
##                         0                         0                         0 
##                   Musical                   Mystery                      News 
##                         0                         0                         0 
##                Reality-TV                   Romance                    Sci-Fi 
##                         0                         0                         0 
##                     Short                     Sport                  Thriller 
##                         0                         0                         0 
##                       War                   Western 
##                         0                         0

Identifying the numeric and categorical columns

num_vars <- IMDB %>%
  select_if(is.numeric) %>%
  colnames()

cat_vars <- setdiff(names(IMDB), num_vars)

UNIVARIATE ANALYSIS OF THE NUMERICAL COLUMNS

Performing univariate analysis for numerical columns

for (column in num_vars) {
  # Create histogram plot
  hist_data <- IMDB[[column]]
  hist(hist_data, main = paste("Univariate Analysis of", column), xlab = column, col = "skyblue", border = "black")
}

UNIVARIATE ANALYSIS OF THE CATEGORICAL COLUMNS

Creating Bar graph for ‘COLOR’

bar_color <- ggplot(IMDB, aes(x = color)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar graph of Color") +
  theme_minimal()
bar_color

Creating Bar graph for ‘director_name’

bar_director_name <- ggplot(IMDB, aes(x = director_name)) +
  geom_bar(fill = "skyblue", color = "white") +
  labs(title = "Bar graph of Director Name") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_director_name

Creating Bar graph for ‘content_rating’

bar_content_rating <- ggplot(IMDB, aes(x = content_rating)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar graph of Content Rating") +
  theme_minimal()
bar_content_rating

Creating bar graph for ‘actor_2_name’

bar_actor_2_name <- ggplot(IMDB, aes(x = actor_2_name)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "bar graph of Actor 2 Name") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  
bar_actor_2_name

Creating bar chart for ‘movie_title’

bar_movie_title <- ggplot(IMDB, aes(x = movie_title)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar graph of Movie Title") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
bar_movie_title

Creating Bar graph for ‘language’

bar_language <- ggplot(IMDB, aes(x = language)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar graph of Language") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) 
bar_language

Creating Bar graph for ‘country’

bar_country <- ggplot(IMDB, aes(x = country)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Bar graph of Country") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
bar_country

BIVARIATE ANALYSIS

scatter_color_num_critic <- ggplot(IMDB, aes(x = color, y = num_critic_for_reviews, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Num Critic for Reviews") +
  theme_minimal()

# Scatter plot for 'content_rating' vs 'num_critic_for_reviews'
scatter_content_rating_num_critic <- ggplot(IMDB, aes(x = content_rating, y = num_critic_for_reviews, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Num Critic for Reviews") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

scatter_content_rating_num_critic

# Scatter plot for 'color' vs 'duration'
scatter_color_duration <- ggplot(IMDB, aes(x = color, y = duration, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Duration") +
  theme_minimal()

scatter_color_duration

# Scatter plot for 'content_rating' vs 'duration'
scatter_content_rating_duration <- ggplot(IMDB, aes(x = content_rating, y = duration, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Duration") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

scatter_content_rating_duration

# Scatter plot for 'color' vs 'director_facebook_likes'
scatter_color_director_facebook <- ggplot(IMDB, aes(x = color, y = director_facebook_likes, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Director Facebook Likes") +
  theme_minimal()

scatter_color_director_facebook

# Scatter plot for 'content_rating' vs 'director_facebook_likes'
scatter_content_rating_director_facebook <- ggplot(IMDB, aes(x = content_rating, y = director_facebook_likes, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Director Facebook Likes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

scatter_content_rating_director_facebook

# Scatter plot for 'color' vs 'actor_3_facebook_likes'
scatter_color_actor3_facebook_likes <- ggplot(IMDB, aes(x = color, y = actor_3_facebook_likes, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Actor 3 Facebook Likes") +
  theme_minimal()

scatter_color_actor3_facebook_likes

# Scatter plot for 'content_rating' vs 'actor_3_facebook_likes'
scatter_content_rating_actor3_facebook_likes <- ggplot(IMDB, aes(x = content_rating, y = actor_3_facebook_likes, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Actor 3 Facebook Likes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels for better readability

scatter_content_rating_actor3_facebook_likes

# Scatter plot for 'color' vs 'actor_1_facebook_likes'
scatter_color_actor1_facebook_likes <- ggplot(IMDB, aes(x = color, y = actor_1_facebook_likes, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Actor 1 Facebook Likes") +
  theme_minimal()

scatter_color_actor1_facebook_likes

# Scatter plot for 'color' vs 'num_voted_users'
scatter_color_num_voted_users <- ggplot(IMDB, aes(x = color, y = num_voted_users, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Num Voted Users") +
  theme_minimal()

scatter_color_num_voted_users

# Scatter plot for 'content_rating' vs 'num_voted_users'
scatter_content_rating_num_voted_users <- ggplot(IMDB, aes(x = content_rating, y = num_voted_users, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Num Voted Users") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_num_voted_users

# Scatter plot for 'color' vs 'facenumber_in_poster'
scatter_color_facenumber_in_poster <- ggplot(IMDB, aes(x = color, y = facenumber_in_poster, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Face Number in Poster") +
  theme_minimal()

scatter_color_facenumber_in_poster

# Scatter plot for 'content_rating' vs 'facenumber_in_poster'
scatter_content_rating_facenumber_in_poster <- ggplot(IMDB, aes(x = content_rating, y = facenumber_in_poster, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Face Number in Poster") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_facenumber_in_poster

# Scatter plot for 'color' vs 'num_user_for_reviews'
scatter_color_num_user_for_reviews <- ggplot(IMDB, aes(x = color, y = num_user_for_reviews, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Num User for Reviews") +
  theme_minimal()

scatter_color_num_user_for_reviews

# Scatter plot for 'content_rating' vs 'num_user_for_reviews'
scatter_content_rating_num_user_for_reviews <- ggplot(IMDB, aes(x = content_rating, y = num_user_for_reviews, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Num User for Reviews") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_num_user_for_reviews

# Scatter plot for 'color' vs 'budget'
scatter_color_budget <- ggplot(IMDB, aes(x = color, y = budget, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Budget") +
  theme_minimal()

scatter_color_budget

# Scatter plot for 'content_rating' vs 'budget'
scatter_content_rating_budget <- ggplot(IMDB, aes(x = content_rating, y = budget, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Budget") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_budget

# Scatter plot for 'color' vs 'title_year'
scatter_color_title_year <- ggplot(IMDB, aes(x = color, y = title_year, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Title Year") +
  theme_minimal()

scatter_color_title_year

# Scatter plot for 'content_rating' vs 'title_year'
scatter_content_rating_title_year <- ggplot(IMDB, aes(x = content_rating, y = title_year, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Title Year") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_title_year

# Scatter plot for 'color' vs 'imdb_score'
scatter_color_imdb_score <- ggplot(IMDB, aes(x = color, y = imdb_score, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs IMDB Score") +
  theme_minimal()

scatter_color_imdb_score

# Scatter plot for 'content_rating' vs 'imdb_score'
scatter_content_rating_imdb_score <- ggplot(IMDB, aes(x = content_rating, y = imdb_score, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs IMDB Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_imdb_score

# Scatter plot for 'color' vs 'aspect_ratio'
scatter_color_aspect_ratio <- ggplot(IMDB, aes(x = color, y = aspect_ratio, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Aspect Ratio") +
  theme_minimal()

scatter_color_aspect_ratio

# Scatter plot for 'content_rating' vs 'aspect_ratio'
scatter_content_rating_aspect_ratio <- ggplot(IMDB, aes(x = content_rating, y = aspect_ratio, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Aspect Ratio") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_aspect_ratio

# Scatter plot for 'color' vs 'movie_facebook_likes'
scatter_color_movie_facebook_likes <- ggplot(IMDB, aes(x = color, y = movie_facebook_likes, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Movie Facebook Likes") +
  theme_minimal()

scatter_color_movie_facebook_likes

# Scatter plot for 'content_rating' vs 'movie_facebook_likes'
scatter_content_rating_movie_facebook_likes <- ggplot(IMDB, aes(x = content_rating, y = movie_facebook_likes, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Movie Facebook Likes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_movie_facebook_likes

# Scatter plot for 'color' vs 'gross'
scatter_color_gross <- ggplot(IMDB, aes(x = color, y = gross, color = color)) +
  geom_point() +
  labs(title = "Scatter Plot: Color vs Gross") +
  theme_minimal()

scatter_color_gross

# Scatter plot for 'content_rating' vs 'gross'
scatter_content_rating_gross <- ggplot(IMDB, aes(x = content_rating, y = gross, color = content_rating)) +
  geom_point() +
  labs(title = "Scatter Plot: Content Rating vs Gross") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

scatter_content_rating_gross

scatter_color_num_critic

Dropping Columns

col_to_drop <- c('color', 'duration', 'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name', 'actor_1_facebook_likes', 'genres', 'actor_3_name', 'plot_keywords', 'movie_imdb_link', 'language', 'country', 'actor_2_facebook_likes', 'aspect_ratio')

Droping the irrelevant columns and print the head of the dataframe

IMDB1 <- IMDB[, !names(IMDB) %in% col_to_drop]
head(IMDB1, 5)
## # A tibble: 5 x 40
##   director_name     num_critic_for_reviews     gross actor_1_name    movie_title
##   <chr>                              <int>     <int> <chr>           <chr>      
## 1 James Cameron                        723 760505847 CCH Pounder     Avatar     
## 2 Gore Verbinski                       302 309404152 Johnny Depp     Pirates of~
## 3 Sam Mendes                           602 200074175 Christoph Waltz Spectre    
## 4 Christopher Nolan                    813 448130642 Tom Hardy       The Dark K~
## 5 Andrew Stanton                       462  73058679 Daryl Sabara    John Carter
## # i 35 more variables: num_voted_users <int>, cast_total_facebook_likes <int>,
## #   facenumber_in_poster <int>, num_user_for_reviews <int>,
## #   content_rating <chr>, budget <dbl>, title_year <int>, imdb_score <dbl>,
## #   movie_facebook_likes <int>, Action <dbl>, Adventure <dbl>, Animation <dbl>,
## #   Biography <dbl>, Comedy <dbl>, Crime <dbl>, Documentary <dbl>, Drama <dbl>,
## #   Family <dbl>, Fantasy <dbl>, `Film-Noir` <dbl>, `Game-Show` <dbl>,
## #   History <dbl>, Horror <dbl>, Music <dbl>, Musical <dbl>, Mystery <dbl>, ...

Reason for dropping columns:

These columns ‘color’,‘duration’,‘director_facebook_likes’,‘actor_3_facebook_likes’,‘actor_2_name’,‘actor_1_facebook_likes’,‘genres’,‘actor_3_name’,‘plot_keywords’,‘movie_imdb_link’,‘language’,‘country’,‘title_year’,‘actor_2_facebook_likes’,‘aspect_ratio’] They do not contain information that directly influences or correlates with the IMDb rating, making them irrelevant for IMDb rating analysis.

‘color’: Majority of the movies are of the type colour thus laking meaning full information

‘duration’: The movie’s duration might not serve as a significant predictor for IMDb ratings or box office performance.

‘director_facebook_likes’: While a director’s influence can affect a movie’s success, the count of their Facebook likes may not be the most relevant metric.

‘actor_3_facebook_likes’, ‘actor_2_name’, ‘actor_1_facebook_likes’, ‘actor_3_name’, ‘actor_2_facebook_likes’: The Facebook likes of individual actors may not strongly indicate a movie’s success or quality.

‘genres’: Although the ‘genres’ column is valuable for genre-related analysis, if you have already extracted this information into binary columns,

‘plot_keywords’: Plot keywords tend to be highly specific and exhibit considerable variation.

‘movie_imdb_link’: The access location of the movie doesn’t derive any important information

‘language’: The language doesn’t show considerable relation with the target variable

‘country’:The country doesn’t show considerable relation with the target variable

‘title_year’: The year a movie was released doesn’t impact the ratings

‘aspect_ratio’: The aspect ratio of movies doesn’t significantly impact IMDb ratings or box office success.

By dropping these columns, we can focus on exploring and analyzing the more relevant features that have a stronger correlation with the ‘imdb_score’ target variable.

Summary statistics for the new dataframe

summary(IMDB1)
##  director_name      num_critic_for_reviews     gross          
##  Length:3768        Min.   :  1.0          Min.   :      162  
##  Class :character   1st Qu.: 75.0          1st Qu.:  7571550  
##  Mode  :character   Median :137.0          Median : 29036498  
##                     Mean   :165.5          Mean   : 51869535  
##                     3rd Qu.:223.0          3rd Qu.: 66466858  
##                     Max.   :813.0          Max.   :760505847  
##  actor_1_name       movie_title        num_voted_users  
##  Length:3768        Length:3768        Min.   :      5  
##  Class :character   Class :character   1st Qu.:  18769  
##  Mode  :character   Mode  :character   Median :  53041  
##                                        Mean   : 104398  
##                                        3rd Qu.: 126909  
##                                        Max.   :1689764  
##  cast_total_facebook_likes facenumber_in_poster num_user_for_reviews
##  Min.   :     0            Min.   : 0.000       Min.   :   1.0      
##  1st Qu.:  1862            1st Qu.: 0.000       1st Qu.: 107.0      
##  Median :  3965            Median : 1.000       Median : 207.0      
##  Mean   : 11382            Mean   : 1.378       Mean   : 332.6      
##  3rd Qu.: 16122            3rd Qu.: 2.000       3rd Qu.: 395.2      
##  Max.   :656730            Max.   :43.000       Max.   :5060.0      
##  content_rating         budget            title_year     imdb_score   
##  Length:3768        Min.   :2.180e+02   Min.   :1920   Min.   :1.600  
##  Class :character   1st Qu.:1.000e+07   1st Qu.:1999   1st Qu.:5.900  
##  Mode  :character   Median :2.500e+07   Median :2005   Median :6.600  
##                     Mean   :4.585e+07   Mean   :2003   Mean   :6.466  
##                     3rd Qu.:5.000e+07   3rd Qu.:2010   3rd Qu.:7.200  
##                     Max.   :1.222e+10   Max.   :2016   Max.   :9.300  
##  movie_facebook_likes     Action         Adventure        Animation      
##  Min.   :     0       Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:     0       1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :   217       Median :0.0000   Median :0.0000   Median :0.00000  
##  Mean   :  9208       Mean   :0.2532   Mean   :0.2059   Mean   :0.05228  
##  3rd Qu.: 11000       3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :349000       Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    Biography           Comedy           Crime         Documentary   
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :0.06396   Mean   :0.3893   Mean   :0.1887   Mean   :0.013  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##      Drama            Family          Fantasy         Film-Noir        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000000  
##  Median :1.0000   Median :0.0000   Median :0.0000   Median :0.0000000  
##  Mean   :0.5061   Mean   :0.1176   Mean   :0.1351   Mean   :0.0002654  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0.0000000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000000  
##    Game-Show    History            Horror           Music        
##  Min.   :0   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0   Mean   :0.03954   Mean   :0.1032   Mean   :0.04007  
##  3rd Qu.:0   3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :0   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##     Musical           Mystery            News     Reality-TV    Romance      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0   Min.   :0    Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0   1st Qu.:0    1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0   Median :0    Median :0.0000  
##  Mean   :0.02654   Mean   :0.1008   Mean   :0   Mean   :0    Mean   :0.2282  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0   3rd Qu.:0    3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :0   Max.   :0    Max.   :1.0000  
##      Sci-Fi           Short       Sport            Thriller     
##  Min.   :0.0000   Min.   :0   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.0000   Median :0   Median :0.00000   Median :0.0000  
##  Mean   :0.1311   Mean   :0   Mean   :0.03901   Mean   :0.2946  
##  3rd Qu.:0.0000   3rd Qu.:0   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :0   Max.   :1.00000   Max.   :1.0000  
##       War             Western       
##  Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000  
##  Mean   :0.04087   Mean   :0.01566  
##  3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000

The Given data set of IMDb rating has varied data,and outlier treatment cannot be undertaken as outliers in budget and IMDb score can be a key focus area like high budget movies.

Adding relevant columns

# Creating 'profit' column
IMDB1$profit <- IMDB1$gross - IMDB1$budget
# Creating 'return_on_investment_perc' column
IMDB1$return_on_investment_perc <- (IMDB1$profit / IMDB1$budget) * 100

Data Visualization

# Plotting Histogram for movie release year
movie_rels <- hist(IMDB1$title_year, breaks = 30, main = "Histogram of Movie Releases",
                   xlab = "Year movie was released", ylab = "Movie Count", col = "skyblue")

movie_rels
## $breaks
##  [1] 1920 1925 1930 1935 1940 1945 1950 1955 1960 1965 1970 1975 1980 1985 1990
## [16] 1995 2000 2005 2010 2015 2020
## 
## $counts
##  [1]   1   2   2   5   0   5   5   3  16  11  20  37  83 143 226 629 876 879 765
## [20]  60
## 
## $density
##  [1] 5.307856e-05 1.061571e-04 1.061571e-04 2.653928e-04 0.000000e+00
##  [6] 2.653928e-04 2.653928e-04 1.592357e-04 8.492569e-04 5.838641e-04
## [11] 1.061571e-03 1.963907e-03 4.405520e-03 7.590234e-03 1.199575e-02
## [16] 3.338641e-02 4.649682e-02 4.665605e-02 4.060510e-02 3.184713e-03
## 
## $mids
##  [1] 1922.5 1927.5 1932.5 1937.5 1942.5 1947.5 1952.5 1957.5 1962.5 1967.5
## [11] 1972.5 1977.5 1982.5 1987.5 1992.5 1997.5 2002.5 2007.5 2012.5 2017.5
## 
## $xname
## [1] "IMDB1$title_year"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

From the graph, it can be infered that there aren’t many records of movies released before 1980.

TOP 20 MOST PROFITABLE MOVIES

#Sort IMDB1 by profit in descending order
sorted_IMDB <- IMDB1[order(IMDB1$profit, decreasing = TRUE), ]
# Select the top 20 most profitable movies
top_20 <- head(sorted_IMDB, 20)
top_20
## # A tibble: 20 x 42
##    director_name     num_critic_for_reviews     gross actor_1_name   movie_title
##    <chr>                              <int>     <int> <chr>          <chr>      
##  1 James Cameron                        723 760505847 CCH Pounder    Avatar     
##  2 Colin Trevorrow                      644 652177271 Bryce Dallas ~ Jurassic W~
##  3 James Cameron                        315 658672302 Leonardo DiCa~ Titanic    
##  4 George Lucas                         282 460935665 Harrison Ford  Star Wars:~
##  5 Steven Spielberg                     215 434949459 Henry Thomas   E.T. the E~
##  6 Joss Whedon                          703 623279547 Chris Hemswor~ The Avenge~
##  7 Roger Allers                         186 422783777 Matthew Brode~ The Lion K~
##  8 George Lucas                         320 474544677 Natalie Portm~ Star Wars:~
##  9 Christopher Nolan                    645 533316061 Christian Bale The Dark K~
## 10 Gary Ross                            673 407999255 Jennifer Lawr~ The Hunger~
## 11 Tim Miller                           579 363024263 Ryan Reynolds  Deadpool   
## 12 Francis Lawrence                     502 424645577 Jennifer Lawr~ The Hunger~
## 13 Steven Spielberg                     308 356784000 Wayne Knight   Jurassic P~
## 14 Pierre Coffin                        306 368049635 Steve Carell   Despicable~
## 15 Clint Eastwood                       490 350123553 Bradley Cooper American S~
## 16 Andrew Stanton                       301 380838870 Alexander Gou~ Finding Ne~
## 17 Andrew Adamson                       205 436471036 Rupert Everett Shrek 2    
## 18 Peter Jackson                        328 377019252 Orlando Bloom  The Lord o~
## 19 Richard Marquand                     197 309125409 Harrison Ford  Star Wars:~
## 20 Robert Zemeckis                      149 329691196 Tom Hanks      Forrest Gu~
## # i 37 more variables: num_voted_users <int>, cast_total_facebook_likes <int>,
## #   facenumber_in_poster <int>, num_user_for_reviews <int>,
## #   content_rating <chr>, budget <dbl>, title_year <int>, imdb_score <dbl>,
## #   movie_facebook_likes <int>, Action <dbl>, Adventure <dbl>, Animation <dbl>,
## #   Biography <dbl>, Comedy <dbl>, Crime <dbl>, Documentary <dbl>, Drama <dbl>,
## #   Family <dbl>, Fantasy <dbl>, `Film-Noir` <dbl>, `Game-Show` <dbl>,
## #   History <dbl>, Horror <dbl>, Music <dbl>, Musical <dbl>, Mystery <dbl>, ...

Create scatter plot with regression line

ggplot(top_20, aes(x = budget / 1e6, y = profit / 1e6, label = movie_title)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_text(hjust = -0.1, vjust = -0.5, size = 3) +
  labs(x = "Budget $million", y = "Profit $million", title = "Top 20 Profitable Movies") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: The following aesthetics were dropped during statistical transformation: label.
## i This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## i Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

It can be inferred from this plot that high budget movies tend to earn more profit. The trend is almost linear, with profit increasing with the increase in budget.

THE 20 MOST PROFITABLE MOVIES BASED ON INVESTMENT Sort the DataFrame by profit in descending order

sorted_IMDB1 <- IMDB1[order(IMDB1$profit, decreasing = TRUE), ]
top_20_1 <- head(sorted_IMDB1, 20)
top_20_1
## # A tibble: 20 x 42
##    director_name     num_critic_for_reviews     gross actor_1_name   movie_title
##    <chr>                              <int>     <int> <chr>          <chr>      
##  1 James Cameron                        723 760505847 CCH Pounder    Avatar     
##  2 Colin Trevorrow                      644 652177271 Bryce Dallas ~ Jurassic W~
##  3 James Cameron                        315 658672302 Leonardo DiCa~ Titanic    
##  4 George Lucas                         282 460935665 Harrison Ford  Star Wars:~
##  5 Steven Spielberg                     215 434949459 Henry Thomas   E.T. the E~
##  6 Joss Whedon                          703 623279547 Chris Hemswor~ The Avenge~
##  7 Roger Allers                         186 422783777 Matthew Brode~ The Lion K~
##  8 George Lucas                         320 474544677 Natalie Portm~ Star Wars:~
##  9 Christopher Nolan                    645 533316061 Christian Bale The Dark K~
## 10 Gary Ross                            673 407999255 Jennifer Lawr~ The Hunger~
## 11 Tim Miller                           579 363024263 Ryan Reynolds  Deadpool   
## 12 Francis Lawrence                     502 424645577 Jennifer Lawr~ The Hunger~
## 13 Steven Spielberg                     308 356784000 Wayne Knight   Jurassic P~
## 14 Pierre Coffin                        306 368049635 Steve Carell   Despicable~
## 15 Clint Eastwood                       490 350123553 Bradley Cooper American S~
## 16 Andrew Stanton                       301 380838870 Alexander Gou~ Finding Ne~
## 17 Andrew Adamson                       205 436471036 Rupert Everett Shrek 2    
## 18 Peter Jackson                        328 377019252 Orlando Bloom  The Lord o~
## 19 Richard Marquand                     197 309125409 Harrison Ford  Star Wars:~
## 20 Robert Zemeckis                      149 329691196 Tom Hanks      Forrest Gu~
## # i 37 more variables: num_voted_users <int>, cast_total_facebook_likes <int>,
## #   facenumber_in_poster <int>, num_user_for_reviews <int>,
## #   content_rating <chr>, budget <dbl>, title_year <int>, imdb_score <dbl>,
## #   movie_facebook_likes <int>, Action <dbl>, Adventure <dbl>, Animation <dbl>,
## #   Biography <dbl>, Comedy <dbl>, Crime <dbl>, Documentary <dbl>, Drama <dbl>,
## #   Family <dbl>, Fantasy <dbl>, `Film-Noir` <dbl>, `Game-Show` <dbl>,
## #   History <dbl>, Horror <dbl>, Music <dbl>, Musical <dbl>, Mystery <dbl>, ...

Create scatter plot with regression line and text labels

ggplot(data = top_20_1, aes(x = budget / 1e6, y = return_on_investment_perc)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  geom_text(aes(label = movie_title), hjust = 0, vjust = 0, size = 3, nudge_y = 0.2) +
  labs(x = "Budget ($million)", y = "Percent Return on Investment", title = "Top 20 Movies based on Return on Investment") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

These are the top 20 movies based on its Percentage Return on Investment ((profit/budget)*100).

Since profit earned by a movie does not give a clear picture about its monetary success over the years, this analysis, over the absolute value of the Return on Investment(ROI) across its Budget, would provide better results. The ROI is high for Low Budget Films and decreases as the budget of the movie increases.

TOP 20 DIRECTORS WITH AVERAGE HIGHEST IMBD RATING

director_avg_imdb <- aggregate(imdb_score ~ director_name, data = IMDB1, FUN = mean) %>% arrange(desc(imdb_score))
head(director_avg_imdb,10)
##        director_name imdb_score
## 1    Charles Chaplin   8.600000
## 2          Tony Kaye   8.600000
## 3   Alfred Hitchcock   8.500000
## 4    Damien Chazelle   8.500000
## 5       Majid Majidi   8.500000
## 6         Ron Fricke   8.500000
## 7       Sergio Leone   8.433333
## 8  Christopher Nolan   8.425000
## 9     Asghar Farhadi   8.400000
## 10  Richard Marquand   8.400000

Selecting the top 20 directors with the highest average IMDb scores

top_20_directors <- head(director_avg_imdb, 20)
top_20_directors
##        director_name imdb_score
## 1    Charles Chaplin   8.600000
## 2          Tony Kaye   8.600000
## 3   Alfred Hitchcock   8.500000
## 4    Damien Chazelle   8.500000
## 5       Majid Majidi   8.500000
## 6         Ron Fricke   8.500000
## 7       Sergio Leone   8.433333
## 8  Christopher Nolan   8.425000
## 9     Asghar Farhadi   8.400000
## 10  Richard Marquand   8.400000
## 11    S.S. Rajamouli   8.400000
## 12      Billy Wilder   8.300000
## 13  Charles Ferguson   8.300000
## 14        Fritz Lang   8.300000
## 15       Lee Unkrich   8.300000
## 16  Lenny Abrahamson   8.300000
## 17       Pete Docter   8.233333
## 18    Hayao Miyazaki   8.225000
## 19        Elia Kazan   8.200000
## 20   George Roy Hill   8.200000

Creating a horizontal bar plot for the top directors

ggplot(top_20_directors, aes(x = imdb_score, y = reorder(director_name, imdb_score))) +
  geom_bar(stat = "identity", fill = "orange") +
  labs(x = "Average IMDb Score", y = "Director Name", title = "Top 20 Directors with Highest Average IMDb Scores") +
  theme_minimal()

TOP DIRECTORS BY TOTAL PROFIT

Calculating the total profit for each director

director_profit <- aggregate(profit ~ director_name, data = IMDB1, FUN = sum)%>%arrange(desc(profit))

Selecting the top 20 directors

top_directors <- head(director_profit, 10)
top_directors
##        director_name     profit
## 1   Steven Spielberg 2486332231
## 2       George Lucas 1386641480
## 3      James Cameron 1199625910
## 4     Chris Columbus  941707624
## 5         Tim Burton  824275480
## 6  Christopher Nolan  808227576
## 7      Peter Jackson  777968050
## 8        Jon Favreau  769381547
## 9   Francis Lawrence  755501971
## 10       Michael Bay  644242537

Creating a bar plot

ggplot(top_directors, aes(x = director_name, y = profit)) +
  geom_bar(stat = "identity", fill = "royalblue") +
  labs(x = "Director Name", y = "Total Profit", title = "Top Directors by Total Profit") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = scales::comma)

Thus it can be infered that top directors can raise prospect of profits but may not promise high ratings

SCATTER PLOT OF MOVIE FACEBOOK LIKES VS IMDB SCORE

ggplot(IMDB, aes(x = movie_facebook_likes, y = imdb_score, color = content_rating)) +
  geom_point(alpha = 0.7) +
  labs(x = "Movie Facebook Likes", y = "IMDb Score", title = "Scatter Plot of IMDb Score vs. Movie Facebook Likes") +
  theme_minimal()

Movie with extremely high Facebook likes tend to have higher imdb score. But the score for movie with low Facebook likes vary in a very wide range.

COMPARISON BETWEEN NUM_CRITICS AND MOVIES_FACEBOOK_LIKES

ggplot(IMDB, aes(x = num_critic_for_reviews, y = movie_facebook_likes)) +
  geom_point(alpha = 0.5) +
  labs(x = "Number of Critic Reviews", y = "Movie Facebook Likes", title = "Comparison of Number of Critic Reviews vs. Movie Facebook Likes") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal()

It can be inferred that the number of critics are instrumental in forming a unbiased opinion that effects the public opinion.

HEATMAP GIVING CORELATION BETWEEN DIFFERNT COLUMNS IN THE IMDB DATA

IMDB2 <- IMDB1[, c('num_critic_for_reviews', 'cast_total_facebook_likes', 'num_user_for_reviews',
                   'title_year', 'movie_facebook_likes', 'num_voted_users', 'facenumber_in_poster', 'budget')]
corr <- cor(IMDB2)
corrplot(corr, method = "color", type = "full", tl.col = "black", tl.srt = 45,
         col = colorRampPalette(c("navy", "white", "firebrick3"))(100),
         title = "Correlation Heatmap")

Based on the heatmap, we can see some high correlations (greater than 0.7) between predictors. According to the highest correlation value 0.95, we find actor_1_facebook_likes is highly correlated with the cast_total_facebook_likes There are high correlations among num_voted_users, num_user_for_reviews and num_critic_for_reviews. We want to keep num_voted_users and take the ratio of num_user_for_reviews and num_critic_for_reviews.

PROFIT VS IMDB SCORE

ggplot(IMDB1, aes(x = profit, y = imdb_score)) +
  geom_point(alpha = 0.5) +
  labs(x = "Profit", y = "IMDb Score", title = "Scatter Plot of Profit vs. IMDb Score") +
  theme_minimal()

It can be infered that highly rated movies have a higher chance of encountering losses. Loss prospect is higher in high budget movies with IMDb Score >6. From the heatmap it can be seen that num_critic_for_reviews and movie_facebook_likes have high degree of co relattion thus we’ll further analyse them

CRITIC REVIEW VS IMDB SCORE

ggplot(IMDB, aes(x = num_critic_for_reviews, y = imdb_score)) +
  geom_point(alpha = 0.5) +
  labs(x = "Number of Critic Reviews", y = "IMDb Score", title = "Scatter Plot of Number of Critic Reviews vs. IMDb Score") +
  theme_minimal()

It can be inferred that there is high correlation as the number of critics rises the IMDb Score is high.

One Hot Encoding

Since the dataset of IMDB is already imported above and also normalised we need not to import it again.

OHE_IMDB <- IMDB

Identifying the Categorical Variables

categorical_vars <- c("color", "director_name", "actor_1_name", "actor_2_name", "actor_3_name", "language", "country", "content_rating")

Performing The One Hot Encoding

encoded_data <- OHE_IMDB %>%
  select(all_of(categorical_vars)) %>%
  mutate_all(funs(as.factor)) %>%
  mutate_all(funs(as.numeric)) %>%
  as.data.frame()
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## i Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## i Please use a list of either functions or lambdas:
## 
## # Simple named list: list(mean = mean, median = median)
## 
## # Auto named with `tibble::lst()`: tibble::lst(mean, median)
## 
## # Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Principal Component Analysis

library('corrr')
## Warning: package 'corrr' was built under R version 4.1.3
library('ggcorrplot')
library("FactoMineR")
## Warning: package 'FactoMineR' was built under R version 4.1.3
library('caret')
## Warning: package 'caret' was built under R version 4.1.3
## Loading required package: lattice
library('factoextra')
## Warning: package 'factoextra' was built under R version 4.1.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Removing Null Values

imdb_clean <- na.omit(encoded_data)
dmy <- dummyVars(" ~.", data = imdb_clean)
trsf <- data.frame(predict(dmy, newdata = imdb_clean))
head(trsf,10)
##    color director_name actor_1_name actor_2_name actor_3_name language country
## 1      3           634          220         1024         2569       12      45
## 2      3           549          699         1621         1015       12      45
## 3      3          1426          263         1826         2332       12      44
## 4      3           258         1366          393         1287       12      45
## 5      3            67          327         1871         1998       12      45
## 6      3          1429          564          900         1434       12      45
## 7      3          1149          163          592         1565       12      45
## 8      3           856          249         1789         2244       12      45
## 9      3           372           24          481         2187       12      44
## 10     3          1691          530         1246           35       12      45
##    content_rating
## 1              10
## 2              10
## 3              10
## 4              10
## 5              10
## 6              10
## 7               9
## 8              10
## 9               9
## 10             10
colnames(imdb_clean)
## [1] "color"          "director_name"  "actor_1_name"   "actor_2_name"  
## [5] "actor_3_name"   "language"       "country"        "content_rating"
numerical_data <- imdb_clean[,1:8]
head(numerical_data)
##   color director_name actor_1_name actor_2_name actor_3_name language country
## 1     3           634          220         1024         2569       12      45
## 2     3           549          699         1621         1015       12      45
## 3     3          1426          263         1826         2332       12      44
## 4     3           258         1366          393         1287       12      45
## 5     3            67          327         1871         1998       12      45
## 6     3          1429          564          900         1434       12      45
##   content_rating
## 1             10
## 2             10
## 3             10
## 4             10
## 5             10
## 6             10

Normalizing the data

data_normalized <- scale(numerical_data)
head(data_normalized)
##      color director_name actor_1_name actor_2_name actor_3_name   language
## 1 0.187521    -0.4885207  -1.22641097   -0.1326478   1.63987050 -0.1445761
## 2 0.187521    -0.6602714  -0.07138804    0.7872096  -0.40001098 -0.1445761
## 3 0.187521     1.1117923  -1.12272415    1.1030735   1.32876889 -0.1445761
## 4 0.187521    -1.2482652   1.53696330   -1.1048924  -0.04296609 -0.1445761
## 5 0.187521    -1.6341993  -0.96839958    1.1724095   0.89033876 -0.1445761
## 6 0.187521     1.1178541  -0.39691642   -0.3237070   0.14999567 -0.1445761
##     country content_rating
## 1 0.3613929   -0.008526776
## 2 0.3613929   -0.008526776
## 3 0.2608147   -0.008526776
## 4 0.3613929   -0.008526776
## 5 0.3613929   -0.008526776
## 6 0.3613929   -0.008526776

Co-relation Matrix for all the components

corr_matrix <- cor(data_normalized)
ggcorrplot(corr_matrix)

Forming the principle components

data.pca <- princomp(corr_matrix)
summary(data.pca)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3    Comp.4    Comp.5
## Standard deviation     0.4281121 0.3765415 0.3623430 0.3444055 0.3410672
## Proportion of Variance 0.2055552 0.1590154 0.1472493 0.1330313 0.1304648
## Cumulative Proportion  0.2055552 0.3645706 0.5118199 0.6448512 0.7753160
##                           Comp.6     Comp.7       Comp.8
## Standard deviation     0.3372233 0.29430640 1.580507e-08
## Proportion of Variance 0.1275407 0.09714332 2.801601e-16
## Cumulative Proportion  0.9028567 1.00000000 1.000000e+00

Scree Plot

fviz_eig(data.pca, addlabels = TRUE)

Biplot of the attributes

# Graph of the variables
fviz_pca_var(data.pca, col.var = "black")

Contribution of each variable

fviz_cos2(data.pca, choice = "var", axes = 1:2)

Biplot combined with cos2

fviz_pca_var(data.pca, col.var = "cos2",
            gradient.cols = c("black", "orange", "green"),
            repel = TRUE)

Time Series Analysis

Loading Relevant Libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.2.1     v purrr   0.3.4
## v readr   2.1.2     v forcats 0.5.1
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x plotly::filter()        masks dplyr::filter(), stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()
## x purrr::lift()           masks caret::lift()
library(readr)
library(dplyr)
library(snakecase)
## Warning: package 'snakecase' was built under R version 4.1.3

Loading the data set

Superstore <- read_csv("C:\\Users\\chinmay.Jain\\Desktop\\R\\Superstore.csv")
## Rows: 9800 Columns: 18
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (15): Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer ...
## dbl  (3): Row ID, Postal Code, Sales
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Extracting the 1st 5 entries

head(Superstore)
## # A tibble: 6 x 18
##   `Row ID` `Order ID`     `Order Date` `Ship Date` `Ship Mode`    `Customer ID`
##      <dbl> <chr>          <chr>        <chr>       <chr>          <chr>        
## 1      541 CA-2015-140795 1/2/2015     3/2/2015    First Class    BD-11500     
## 2      158 CA-2015-104269 1/3/2015     6/3/2015    Second Class   DB-13060     
## 3     5714 US-2015-143707 1/3/2015     5/3/2015    Standard Class HR-14770     
## 4     6548 CA-2015-113880 1/3/2015     5/3/2015    Standard Class VF-21715     
## 5     6549 CA-2015-113880 1/3/2015     5/3/2015    Standard Class VF-21715     
## 6     7948 CA-2015-131009 1/3/2015     5/3/2015    Standard Class SC-20380     
## # i 12 more variables: `Customer Name` <chr>, Segment <chr>, Country <chr>,
## #   City <chr>, State <chr>, `Postal Code` <dbl>, Region <chr>,
## #   `Product ID` <chr>, Category <chr>, `Sub-Category` <chr>,
## #   `Product Name` <chr>, Sales <dbl>

Converting the headers to snake case

colnames(Superstore)<-to_snake_case(colnames(Superstore))
colnames(Superstore)
##  [1] "row_id"        "order_id"      "order_date"    "ship_date"    
##  [5] "ship_mode"     "customer_id"   "customer_name" "segment"      
##  [9] "country"       "city"          "state"         "postal_code"  
## [13] "region"        "product_id"    "category"      "sub_category" 
## [17] "product_name"  "sales"

Converting ‘order_date’ column to proper date format

Superstore$order_date <- as.Date(Superstore$order_date, format = "%d/%m/%Y")

Converting ‘ship_date’ column to proper date format

Superstore$ship_date <- as.Date(Superstore$ship_date, format = "%d/%m/%Y")
head(Superstore)
## # A tibble: 6 x 18
##   row_id order_id      order_date ship_date  ship_mode customer_id customer_name
##    <dbl> <chr>         <date>     <date>     <chr>     <chr>       <chr>        
## 1    541 CA-2015-1407~ 2015-02-01 2015-02-03 First Cl~ BD-11500    Bradley Druc~
## 2    158 CA-2015-1042~ 2015-03-01 2015-03-06 Second C~ DB-13060    Dave Brooks  
## 3   5714 US-2015-1437~ 2015-03-01 2015-03-05 Standard~ HR-14770    Hallie Redmo~
## 4   6548 CA-2015-1138~ 2015-03-01 2015-03-05 Standard~ VF-21715    Vicky Freyma~
## 5   6549 CA-2015-1138~ 2015-03-01 2015-03-05 Standard~ VF-21715    Vicky Freyma~
## 6   7948 CA-2015-1310~ 2015-03-01 2015-03-05 Standard~ SC-20380    Shahid Colli~
## # i 11 more variables: segment <chr>, country <chr>, city <chr>, state <chr>,
## #   postal_code <dbl>, region <chr>, product_id <chr>, category <chr>,
## #   sub_category <chr>, product_name <chr>, sales <dbl>

Grouping by the product category and order_date

base <- Superstore %>%
  group_by(order_date,category) %>%
  summarise(sales = sum(sales)) 
## `summarise()` has grouped output by 'order_date'. You can override using the
## `.groups` argument.
head(base)
## # A tibble: 6 x 3
## # Groups:   order_date [4]
##   order_date category         sales
##   <date>     <chr>            <dbl>
## 1 2015-01-03 Office Supplies   16.4
## 2 2015-01-04 Office Supplies  288. 
## 3 2015-01-05 Office Supplies   19.5
## 4 2015-01-06 Furniture       2574. 
## 5 2015-01-06 Office Supplies  685. 
## 6 2015-01-06 Technology      1148.

Extracting year, month, and quarter from the order_date column

base$Year <- format(base$order_date, "%Y")
base$Month <- format(base$order_date, "%m")
base$Quarter <- quarters(base$order_date)

Aggregating sales at yearly level product category-wise

yearly_sales <- aggregate(sales ~ Year + category, data = base, sum)

Creating time series object for each year

salests_yearly <- ts(yearly_sales$sales, start = c(min(base$Year)))
head(salests_yearly)
## [1] 156477.9 164053.9 195813.0 212313.8 149512.8 133124.4

Aggregating sales for each quarter of each year product category-wise

quarterly_sales <- aggregate(sales ~ Year + Quarter + category, data = base, sum)

Converting to ts object for quarterly sales of each year

salests_quarterly <- ts(quarterly_sales$sales, frequency = 4)
head(salests_quarterly)
## [1] 22300.30 23596.95 23820.96 23597.98 28002.21 27391.62

Aggregating sales at monthly level product category-wise

monthly_sales <- aggregate(sales ~ Year + Month + category, data = base, sum)

Creating time series object for 12 months of each year

salests_monthly <- ts(monthly_sales$sales, frequency = 12, start = c(2015, 1))
head(salests_monthly)
## [1]  6217.277 11739.942  7622.743  5930.162  1839.658  3134.374

Creating time series object for sales for each day

salests_daily <- ts(base$sales)
head(salests_daily)
## [1]   16.448  288.060   19.536 2573.820  685.340 1147.940

Plotting Graphs for the daily, monthly, quaterly and yearly sales

plot.ts(salests_daily)

plot.ts(salests_monthly)

plot.ts(salests_quarterly)

plot.ts(salests_yearly)

Transforming to log time series

logdaily<-log(salests_daily)
logmonthly<-log(salests_monthly)
logquaterly<-log(salests_quarterly)
logquaterly<-log(salests_yearly)

Plotting Graphs for time series trasformed using log

plot.ts(logdaily)

plot.ts(logmonthly)

plot.ts(logquaterly)
plot.ts(logquaterly)

Decomposing Time Series

library("TTR")
## Warning: package 'TTR' was built under R version 4.1.3

Simple Moving Average

plot.ts(SMA(salests_daily))

plot.ts(SMA(salests_monthly))

plot.ts(SMA(salests_quarterly))

plot.ts(SMA(salests_yearly))

Decomposing Seasonal Data

salests_monthlyd<- decompose(salests_monthly)
salests_quarterlyd<- decompose(salests_quarterly)
plot(salests_monthlyd)

plot(salests_quarterlyd)

ARIMA Models

Daily Sales

dailydiff1 <- diff(salests_daily, differences=1)
plot.ts(dailydiff1)

dailydiff2 <- diff(salests_daily, differences=2)
plot.ts(dailydiff2)

Monthly Sales

monthlydiff1 <- diff(salests_monthly, differences=1)
plot.ts(monthlydiff1)

monthlydiff2 <- diff(salests_monthly, differences=2)
plot.ts(monthlydiff2)

Quarterly Sales

quaterlydiff1 <- diff(salests_quarterly, differences=1)
plot.ts(quaterlydiff1)

quaterlydiff2 <- diff(salests_quarterly, differences=2)
plot.ts(quaterlydiff2)

Yearly Sales

yearlydiff1 <- diff(salests_yearly, differences=1)
plot.ts(yearlydiff1)
yearlydiff2 <- diff(salests_yearly, differences=2)
plot.ts(yearlydiff1)

Selecting a Candidate ARIMA Model

Daily

acf(dailydiff1, lag.max=20)

acf(dailydiff1, lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'dailydiff1', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000 -0.491 -0.005 -0.010  0.001  0.006  0.007  0.013 -0.034 -0.005  0.048 
##     11     12     13     14     15     16     17     18     19     20 
## -0.038  0.016 -0.019  0.005  0.007 -0.005  0.028 -0.041  0.018  0.005

Monthly

acf(monthlydiff1, lag.max=20)

acf(monthlydiff1, lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'monthlydiff1', by lag
## 
## 0.0000 0.0833 0.1667 0.2500 0.3333 0.4167 0.5000 0.5833 0.6667 0.7500 0.8333 
##  1.000 -0.332 -0.190  0.064  0.043  0.075 -0.250  0.041  0.137 -0.068 -0.115 
## 0.9167 1.0000 1.0833 1.1667 1.2500 1.3333 1.4167 1.5000 1.5833 1.6667 
##  0.031  0.204 -0.058 -0.112 -0.041  0.234 -0.095 -0.154  0.029  0.050

Quarterly

acf(quaterlydiff1, lag.max=20)

acf(quaterlydiff1, lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'quaterlydiff1', by lag
## 
##   0.00   0.25   0.50   0.75   1.00   1.25   1.50   1.75   2.00   2.25   2.50 
##  1.000 -0.126 -0.429  0.089  0.278 -0.059 -0.247  0.040  0.142  0.037 -0.219 
##   2.75   3.00   3.25   3.50   3.75   4.00   4.25   4.50   4.75   5.00 
## -0.091  0.202  0.030 -0.233 -0.127  0.350  0.099 -0.204 -0.057  0.197

Yearly

acf(yearlydiff2, lag.max=20)

acf(yearlydiff2, lag.max=20, plot=FALSE)
## 
## Autocorrelations of series 'yearlydiff2', by lag
## 
##      0      1      2      3      4      5      6      7      8      9 
##  1.000 -0.113 -0.661 -0.017  0.486  0.005 -0.265  0.029  0.045 -0.009

ARIMA

library(forecast)
## Warning: package 'forecast' was built under R version 4.1.3
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
daily_a <- auto.arima(salests_daily)
monthly_a<-auto.arima(salests_monthly)
quaterly_a<-auto.arima(salests_quarterly)
yearly_a<-auto.arima(salests_yearly)

daily_a
## Series: salests_daily 
## ARIMA(1,1,2) 
## 
## Coefficients:
##          ar1      ma1     ma2
##       0.5058  -1.4638  0.4698
## s.e.  0.4697   0.4713  0.4658
## 
## sigma^2 = 1655252:  log likelihood = -24346.61
## AIC=48701.21   AICc=48701.22   BIC=48725.01
monthly_a
## Series: salests_monthly 
## ARIMA(0,1,2)(1,0,2)[12] 
## 
## Coefficients:
##           ma1      ma2     sar1    sma1    sma2
##       -0.5652  -0.1799  -0.8527  1.1133  0.3979
## s.e.   0.0874   0.0922   0.1364  0.1560  0.0965
## 
## sigma^2 = 57131580:  log likelihood = -1480.7
## AIC=2973.41   AICc=2974.03   BIC=2991.19
quaterly_a
## Series: salests_quarterly 
## ARIMA(0,1,0) 
## 
## sigma^2 = 347299618:  log likelihood = -528.83
## AIC=1059.67   AICc=1059.76   BIC=1061.52
yearly_a
## Series: salests_yearly 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##            mean
##       188461.40
## s.e.   11217.56
## 
## sigma^2 = 1.647e+09:  log likelihood = -143.84
## AIC=291.68   AICc=293.01   BIC=292.65